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Abstract. The popularity of the EM algorithm owes much to the 1977 
paper by Dempster, Laird and Rubin. That paper gave the algorithm 
its name, identified the general form and some key properties of the 
algorithm and established its broad applicability in scientific research. 
This review gives a nontechnical introduction to the algorithm for a 
general scientific audience, and presents a few examples characteristic 
of its application. 
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counting, linkage analysis, finding regulatory motifs, diffusion batteries, 
particle size distributions. 
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1. INTRODUCTION 

Incomplete data arise in many different settings 
in the empirical sciences. An obvious example of in- 
complete data is missing data, where multiple mea- 
surements are made on each subject, but some sub- 
jects are not observed on all measurements. Many 
applications are more subtle and consist of problems 
where the observed data only shed light on "hid- 
den" or "latent" traits which are of primary inter- 
est. This frequently occurs in the engineering setting 
with reconstruction or indirect measurement, such 
as medical imaging with emission and transmission 
tomography. The example from public health that 
is diagrammed in Box 4 is another example of in- 
direct measurement. Many applications involve lo- 
cating clusters of observations with similar features; 
distinguishing features are functions of the observed 
data, but cluster membership must be inferred. Such 
problems occur prominently in bioinformatics and 
speech recognition. Finally, some statistical models, 
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such as variance components or random effects, can 
be reformulated as missing data problems simply to 
make computations easier, even though no data are 
missing. 

Here we discuss the EM algorithm (Dempster, Laird 
and Rubin, 1977, henceforth DLR), which is de- 
signed for computations in a broadly defined incom- 
plete data setting; it is widely used in many different 
areas in the empirical sciences. There are numerous 
applications (Smith, 1957; Hasselblad, 1966; Baum 
et al., 1970; Orchard and Woodbury, 1972, to men- 
tion a few) of the algorithm that predate the DLR 
paper, but that paper described some key features 
of the algorithm that underlie its widespread pop- 
ularity. First was the recognition of the generality 
of the algorithm, and a probabilistic definition of 
incomplete data that can be applied very broadly 
in different settings. Secondly, the paper provided 
a simple and intuitive description of the algorithm 
and named it "Expectation-Maximization," or EM 
for short, to reflect the two steps that comprise its 
essential nature. There are many technical descrip- 
tions of the algorithm and its properties (see, e.g., 
McLachlan and Krishnan, 1997), and a variety of 
generalizations. The purpose of this note is to pro- 
vide an intuitive description of how the algorithm 
works and give four short examples from genetics, 
genomics and public health. 

We first give a heuristic characterization of the al- 
gorithm; the remainder of the introduction discusses 
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its formulation in more detail. The essential idea of 
the EM is to postulate the availability of additional 
data (e.g., the values of the missing measurements in 
the missing data setting) that make the estimation 
problem easy. The EM then proceeds by alternat- 
ing between two steps: "fill in the additional data 
(E-step)" and "estimate the parameters using the 
filled in data (M-step)." This two-step process is re- 
peated until convergence. DLR put the algorithm on 
a rigorous foundation by spelling out how the two 
steps should be implemented in a general setting and 
showing that it maximizes an objective function. 

To describe the algorithm, the first task is to iden- 
tify a "complete data" version of the problem. This 
is often the most creative part of the application, 
since once a complete data analogue has been iden- 
tified for the observed data, the application of the 
EM is straightforward. There generally will not be 
a unique representation for the complete data, but 
often there is an "obvious" one. For example, in the 
case of missing data, the complete data is best de- 
scribed as the observed data, plus the missing ob- 
servations; in this way, the complete data specifies 
a complete set of measurements for each subject. 
In other settings, the complete data will consist of 
the data that are observed plus some additional in- 
formation that would make the problem easy; the 
bioinformatics example we describe in Box 3 and 
clustering examples in general fall into this category. 
At one extreme, the complete data is sometimes best 
described as just the missing data, because the ob- 
served data provides no additional information when 
the missing data are available. As an example, con- 
sider our first example (Box 1) of estimating gene 
frequencies. It would be straightforward to estimate 
the frequencies of different gene variants (termed al- 
leles), if we directly observed individual genotypes; 
in the absence of genotype information, we use data 
on traits related to the genotype. Here the complete 
data are the observed genetic traits and the geno- 
types of individuals in the sample (Box 1), but if we 
know the genotypes, the observed traits contribute 
no additional information about gene frequencies. 

Having identified the complete data, we can now 
delineate the E- and M-steps of the algorithm. Al- 
though logically the E-step precedes the M-step in 
the computations, hence EM, conceptually it is eas- 
ier to define the M-step first. 

M-step: The way we define the complete data de- 
termines the M-step of the algorithm. At the M-step, 



we obtain our estimates of the parameters of interest 
assuming we have observed the possibly hypotheti- 
cal "complete data." Formally the M-step (for Max- 
imum likelihood) uses the complete data to obtain 
maximum likelihood estimates of the parameters. In 
many instances, this will be simple, familiar statis- 
tics, that is, means, variances and covariances, or 
proportions; each of the four examples we will dis- 
cuss is based on a complete data multinomial likeli- 
hood and just requires estimating probabilities from 
sample frequencies at the M-step. Exactly how this 
M-step is carried out depends upon the application, 
but it is worth noting that the ease of computations 
depends in large measure on defining the complete 
data so that performing the M-step is easy. 

E-step: Once the M-step is done, we have interim 
estimates of the relevant parameters which can now 
be used, along with the observed data, to calculate 
expected values for the "missing data," or techni- 
cally, computing the expected log-likelihood of the 
complete data. Again, the exact nature of the E-step 
(for computing the Expected log-likelihood) is ap- 
plication dependent; in each of the examples we dis- 
cuss, the E-step involves the computation of con- 
ditional probabilities. These two steps are iterated 
until convergence. Although the algorithm is not 
guaranteed to maximize the likelihood function, it 
has some attractive numerical properties. These in- 
clude increasing the likelihood at each iteration and 
a guarantee that the parameter estimates will re- 
main in the boundary space, that is, probabilities 
will always be between and 1, and variance-cova- 
riance matrices will be positive semi-definite. 

2. EXAMPLES FROM GENETICS 

The genetics literature is replete with examples 
of the EM. Before genotypes were readily available 
via modern technologies, the EM was often used to 
estimate gene frequencies from data on associated 
Mendelian traits. An individual's genotype consists 
of a pair of alleles, one inherited from each par- 
ent. Even when the genotypes of individuals are ob- 
served, there are still many important estimation 
problems with naturally occurring incompleteness, 
especially if it is important to determine which al- 
lele is inherited from which parent. One example is 
the reconstruction of haplotypes, that is, the set of 
alleles at different loci all lying on the same chro- 
mosome, from pairs of alleles at the different loci. 
A second example that we will discuss is estimating 
allele sharing in a pair of affected siblings. 
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Gene Counting: Estimating the allele frequencies at the ABO 
hemoglobin locus 



Saa Rao 




Sab 


S(M1 



Missing Data: 

Genotype counts 

Observed Data: 
Blood Type Counts 

Complete Data: Genotypes and blood type counts. 

M-Step: I ■ stimuli: allele frequencies using complete data 

Jo + (g,H, + 2 gBO + Soii 1 2n > 

where n is the total number of people, and 2n is the number of 
alleles. 

E-Step: Compute expected values for die unobserved genotypes 
E(g u ) = !, P(AA | AA or AO) 
E(g ltl ) = t 4 P{AO\AA or AO) 
= t g P(BB\BB or BO) 
E(g.,u) = V P(BO\BB or BO), 

where expectations are not needed for g 4B and g flfl since they are 
uniquely determined by blood type. See Box I B for details 



Gene Counting 

Box 1 illustrates using the EM algorithm for es- 
timating the three allele frequencies at the ABO 
hemoglobin locus. When the genotype data are di- 
rectly observed, estimation is referred to as gene 
counting, because one simply counts the number of 
alleles of each type, and divides by the total number 
of alleles. Gene counting with the observed genotype 
data is shown in the M-step of Box 1, where the 
number of individuals (possibly unobserved) with 
each genotype is denoted as qaa, 9AB-, etc., and the 
observed number of individuals with each blood type 
are denoted as tA, ts, etc. For an autosomal chro- 
mosome, each person contributes two alleles, hence 
the denominator of each estimated frequency is 2n, 
where n is the total number of subjects. 

The E-step takes into account the known relation- 
ships between blood type and genotype given in the 
top of Box 1, namely a person with blood type A 
must be either AA or AO, and similarly for blood 
type B, but blood types AB or O identify genotypes 
uniquely. For the E-step, we assume the allele fre- 
quencies are known and fixed at the values estimated 
at the previous M-step, and use them to calculate 
the P(genotype (blood type). 

The EM algorithm consists of cycling through the 
two steps, alternately estimating the allele frequen- 
cies assuming the allele counts are observed, then 



Box IB Gene Counting: A Numerical Example 



Observed Blood Types: I, =200, t„ = 50, t M =40, /„ = 300 

Initialization: /, = 1/3, /„ = 1/3, f„ = 1/3 
Iteration Step Bao 8bb Kbo' f.t fa r o 



initialize .3333 .3333 .3333 

1 E 66,6667 133.3333 20 40 : 

M ,2556 3 .1000 .6444 

2 E 33.0935 166.9065 4.3200 55.6800 

M .2276 .0869 .6855 

3 E 28.4762 171.5238 3.5766 56.4234 

M .2237 .0863 .6900 

4 E 27.8980 172.1020 3.5325 56.4675 

M .2237 .0863 .6900 



Notes: 

1 . Note thai g,„ - I Ja and g lx) - l a are observed and do not change. 

2. The genotypes are computed at the F,-slep using the frequencies 
from the preceding M-step (or initialization), i.e. 

tm = i,{p(BO)/[p{BO)+ p(BB)]} 
40 = 60|2 ( l/3)y[2 (1/3) ! + ( 1/3)" ]} = 60 ( 2/3) 

3. The allele frequencies are estimated at the M-step using the 
expected genoty pe counts computed at the preceding E-slep and 
formulas given in Box 1 . i.e. 

.2556= (66.6667 + 2x1 33.333 + 40)/l200 



updating the expected allele counts, assuming that 
the frequencies are known. We can start with either 
the E- or the M-step, depending upon whether it is 
easier to start with a guess about the frequencies 
or with a guess about the genotype counts. Note 
that in the iterations, the genotypes used at the M- 
step are expected, computed at the previous E-step; 
the allele frequencies used to compute the condi- 
tional expectations at the E-step are likewise those 
updated at the previous M-step. 

Box IB provides a numerical example to illustrate 
the algorithm, using hypothetical, but not unreal- 
istic blood counts from a sample of 600 subjects. 
We start the iterations by setting the frequencies to 
be 1/3 each. Computing the expected genotype fre- 
quencies at the E-step uses the assumption of Hardy 
Weinberg equilibrium to calculate the probability of 
genotype frequencies, given allele frequencies. Hardy 
Weinberg assumes that allele frequency is the same 
for everyone in the population and random mating, 
hence the two alleles of an individual are indepen- 
dent. Simple inspection of the observed data sug- 
gests that our initial frequencies are not very good 
estimates, but the algorithm converges in only a few 
iterations. For actual data examples and further dis- 
cussion of using the EM for gene counting, see Lange 
(2002), Chapter 12. 
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Linkage Example 

Linkage analysis is widely used to find the chro- 
mosomal location of a hypothesized gene affecting 
some trait of interest. Simply put, linkage refers to 
the relative position of two genetic locations; if they 
are physically "close" on the same chromosome, the 
two loci are said to be linked. If they are on the 
same chromosome, but distant, or if they are on dif- 
ferent chromosomes, they are said to be unlinked. 
The EM has many applications in linkage analysis; 
here we consider its use in estimating allele shar- 
ing of affected siblings. When two loci are unlinked, 
Mendel's laws of inheritance holds independently for 
the two loci, and one can easily calculate the proba- 
bility of allele sharing between two siblings at a ge- 
netic locus. In particular, the probability that both 
siblings inherit the exact same alleles from both 
their parents (allele sharing is 2) is |. The prob- 
ability that they each inherit two different alleles 
from both parents (or allele sharing is zero) is also 
4, and the probability that they inherit the same 
allele from exactly one parent (allele sharing is 1) is 
^. This type of allele sharing is called identity-by- 
descent (IBD) to indicate that the alleles are shared 
because the sibs have obtained the same copy from 
a common parent. 

To implement a linkage analysis, we obtain data 
on a genetic locus, called marker data, that we hy- 
pothesize is close to a genetic locus that affects our 
disease of interest. If both siblings are affected with 
the same genetic disease, and the marker and the 
disease locus are linked, we expect they are likely 
sharing at least one allele, inherited from the same 
parent, at the unobserved disease locus. Further, 
if the locus underlying the disease is linked to the 
marker, independent transmission at the 2 loci does 
not hold. Rather, we expect that the allele shar- 
ing probabilities at the marker differ from j, i, |, 
in the direction of increased sharing. To test this, 
we estimate the probabilities of sharing 0, 1 or 2 
alleles under Ha, vro, tt\, tt2 say, from the sample 
of affected sib pairs and compare them to the allele 
sharing probabilities under the null hypothesis of no 
linkage, |, ±, \. 

Estimation of the allele sharing frequencies un- 
der the alternative of linkage between the marker 
and the disease locus is straightforward if we could 
observe the IBD sharing directly; we would simply 
count the number of affected sibs pairs sharing 0, 
1 or 2 alleles, and divide by the number of affected 



sib pairs. This is illustrated in the M-step of Box 2 
where the complete data IBD counts are denoted as 
Zo, Z\ and Z<i- As illustrated in Box 2, one cannot 
always infer IBD sharing from data on parental and 
offspring genotypes; it depends upon the pattern of 
alleles which are observed. The "complete data" il- 
lustrates a setting where we can always deduce IBD 
sharing, that is, the parents have four distinct alle- 
les. The "observed data" shown in Box 2 illustrates 
two situations where we cannot. In general, the "ob- 
served data" are a pattern of sharing, for example, 
either or 1, 1 or 2, etc., together with the ob- 
served parental genotypes. The M-step is based on 
a multinomial likelihood, assuming that we observe 
IBD for each sib pair, and the E-step provides expec- 
tations of the multinomial counts, by computing the 
expected allele sharing for each pair conditional on 
their observed pattern of sharing, and adding over 
pairs. The conditional probability of IBD sharing is 
obtained from Bayes rule 

P(LBD = j|observed data) 

cx P(observed data|IBD = j)-Kj for j = 0, 1,2, 

where ttj has been estimated at the M-step; and 
^(observed data|IBD = j) is calculated from the ob- 
served data on parents and child's genotypes (Risch, 
1990). 

Risch (1990) also extends the EM to the setting 
where parental data are missing, but now one must 
first have an estimate of parental allele frequencies. 
Kruglyak et al. (1995) extended the algorithm to 
cover multipoint analyses with complex pedigrees 
as well as additional markers, but the basic idea of 
using EM to estimate IBD probabilities under Ha 
remains the same. 

3. EXAMPLE FROM COMPUTATIONAL 
BIOLOGY: FINDING MOTIFS 

Computational Biology deals with the analysis of 
data that comes from sequencing the DNA of hu- 
mans and other organisms. The specific sequence 
of the four nucleotides which make up DNA, Ade- 
nine, Guanosine, Thymine and Cytosine, or A, G, 
C, T, for short, determine the function of genes 
and location of genes, the process of RNA transcrip- 
tion, and the manufacture of proteins essential for 
cell function. Understanding many fundamental bio- 
logical processes requires tools to identify relatively 
short patterns of these four base pairs embedded in 
long strings of base pairs (approximately 3 billion in 
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Complete Data: Number of alleles shared IBD for each pair of sibs: 
AB]-j-@ [a!^(CD ) [ab|-|H^D ) 

(a£) (AC) (ac) (AD) (ax) (BD) 



IBD = 2 IBD = I 

Observed Data: Pattern of allele sharing: 



Hit; >> 




AA -riCD) AB -riAB 




(AC?) (At)) (AB) (AB) 



IBD = or I 



IBD = 2orO 



M-Step: Estimate IBD sharing, assuming IBD is known for each 

pa "' jr„ = Z (l /«, n t =Zjn, x 2 =Z 2 jn 

where 11 is the number of sib pairs, and 2 n , Z,, Z, are counts of 
pairs sharing 0, I, or 2 alleles. 

R-Stcp: Compute the expelled counts ofallele sharing, e.g.: 
for j = 0, 1 , 2 

£^2, ) = £ P[i'* pair shares j alleles | observed data.) 
where summation is over all sib pairs and 

P^i" 1 pair shares/' [observed data) ct P (observed data|sharingy);r, . 



Complete Data: Seven DNA fragments, with the motifs underlined: 

C'GGGGCTATCCAGCTGGGTCGTCALATCCCCrr 
TGCCCAATAA GOOCAACT CCAAAGGGOACAAA 
GG ATtiCiArCT GATCVVVCiTTTGACGACCTA 
AAGG AAGCAACCX CAGGAGCCCTTIGCTC 
A AT I' ITC'1 AA A AAGATTATAATGTCX >(i IC VITt ■GAACT IV 
CTGCTGTACAACTGAGATCATGCTGC ATCiCCAT ITCAAC 
ACATGAiCTmG ATGGCACTT GGATCAGGGAAIGAIGC 
Observed Data: The sequences without the motifs show 11: 
CGGGGt'TATCCAGCI'GGGTCGTGACATCCCCTT 
lire* \AI AAGGtifAAl let VAACit'tKJAt'AAA 
□GATGGA I CTGATGVV VGTT1 GACGACCTA 
v\:.i.\\< ! w<;< 1 Aliti win r. rrmt'K 
AA'I "I TK'l "AAAAAG ATI "ATAATOTCGGTCCTI "GGAAC ITC 
CTGCrGTACAACTGAGATCATXiCrGCATCICCATTTCAAC 
AC A TG ATCTTTTGATGGC ACTTGG ATG AGGGA ATGATGC 
M-Step: Estimate Ihe probability of each base pair in each motif 
position, assuming known starting position, i.e., P(A J position 1 ) = 
5/7, etc.. 





Background 






Motif Position 






Letter 


PostHon 


1 




3 4 5 S 


7 


9 


A 


48 


5 


! 


5 5 








c 


43 


D 


a 


14 2 


6 


g 


a 


46 


1 


1 


S 3 1 








r 


46 




s 


1 


1 


8 


Totals 


144 


:■' 


? 


? 7 7 7 


7 


/ 



E-Step: Compute expected counts needed for (he M-stcp: 

Expected #A's in position I 

= £E/ J (A in position I | motif starting point, observed data) 
Where Ihc summation is over all possible starting positions wilhin a 
fragment, and over all fragments. 

Data modified from Jones and Pevzner ( 1 985. f igure 4.3). 



the entire human genome). The problems are chal- 
lenging and relevant to statisticians because the se- 
quences of interest may not be "exact." For example, 
the simple sequence consisting of two specific base 
pairs, CG, is relatively rare in the DNA of many 
organisms, because CG readily mutates to TG. But 
mutation is suppressed in regions near specific genes, 
forming CG rich islands, or stretches of DNA that 
have more CG pairs than "usual." One approach 
to identifying GC rich islands is to treat strings of 
DNA as realizations of Hidden Markov Models, with 
different (unobserved) states corresponding to GC 
rich or GC poor regions (Parida, 2008, Chapter 5.5; 
Jones and Pevzner, 2004, Chapter 11). An EM algo- 
rithm to estimate the state and transmission prob- 
abilities was developed by Baum et al. (1970). 

A related problem in computational biology where 
EM is used is the identification of regulatory mo- 
tifs. Motifs are short sequences of base pairs, from 
6 to 20 pairs in length, which have a similar pattern 
of base pairs. Proteins bind to functional motifs lo- 
cated upstream of genes to encourage the process 
of RNA transcription in the genes. Given a set of 
known genes, the approximate location of the corre- 
sponding functional motifs is known, however their 
exact sequences vary because the protein binding 



process does not require an exact sequence of base 
pairs. 

The basic idea is illustrated in the top panel of 
Box 3. Seven hypothetical DNA fragments are given 
an input data (Jones and Pevzner, 2004, Chapter 
4). The underlined portion of each sequence denotes 
the actual (unobserved) motif. Here we assume there 
is only one motif in each input sequence, and it is 
known to be exactly eight base pairs long. The ex- 
act DNA letters vary from fragment to fragment, 
but two motifs are "more similar" than two ran- 
domly selected sequences of eight DNA letters. The 
problem can be defined probabilistically by assum- 
ing that the probability corresponding to a letter 
in a motif location is the same for every motif, but 
differs from nonmotif, or "background" DNA. The 
objective is to characterize the general pattern of 
DNA for the motifs as a "consensus" sequence. 

A purely computational approach to this problem 
is to pick a metric for measuring similarity between 
eight letter sequences, such as the number of posi- 
tions that have the same letter across all fragments, 
and then search for the set of eight letter sequences 
that optimizes the metric. This is a time consuming 
search, since each possible alignment has to be con- 
sidered, and each fragment of X letters has X — 7 



6 



N. M. LAIRD 



possible starting points for the eight letter sequence; 
additionally it is difficult to measure optimality of 
the solution. 

Alternatively, one may utilize a probability model 
for the fragment data, and use the EM (Lawrence 
and Reilly, 1990). The data can be modeled as in- 
complete because the motif locations are not ob- 
served. Conceptually, the missing data are seven in- 
dicator vectors, indicating the starting point of each 
motif in each of the seven fragments. The parame- 
ters to be estimated are the frequencies of the four 
DNA letters in the motif and nonmotif positions. 
We assume a 4 x 8 matrix of multinomial probabil- 
ities, one column for each position in a motif. Each 
element of the column gives P(A), P(C), P{G) and 
P(T) for the DNA letter in each position of the mo- 
tif. If we knew the location of each motif in each 
fragment, we would estimate these probability vec- 
tors via simple multinomial frequencies for each po- 
sition of the motif as is illustrated in the M-step of 
Box 3. 

We do not know the starting points of each motif, 
but given the motif and background probabilities, 
one can readily calculate the conditional probability 
of any possible starting position for each fragment, 
by assuming a priori that all possible starting val- 
ues are equally likely and evaluating the probability 
of each DNA sequence for each assumed starting 
point. These probabilities of each starting point for 
each motif are then used to compute the expected 
multinomial counts needed for the M-step, as shown 
in the E-step of Box 3. 

At the conclusion of the iterations of the EM, we 
have the matrix of estimated base pair probabilities 
for each location. These can be used to compute a 
"consensus sequence" by taking the base pair with 
the highest frequency in each location of the motif. 
For example, based on the frequencies computed at 
the M-step in Box 4, the consensus sequence would 
be ATGCAACT. At the conclusion of the EM we 
also have the probability of each alignment (deter- 
mined by the estimated starting points). This has 
been used to assign motif locations to specific se- 
quences. While the motif sequence frequencies can 
be well estimated with a large number of fragments, 
in this formulation there is no simplifying model for 
alignment probabilities, and the number of possibil- 
ities grows exponentially with the number of frag- 
ments. Hence alignments are unlikely to be well es- 
timated. This has led to interest in Bayesian ap- 
proaches based on Gibbs sampling (Lawrence et al., 



1993). The application of the EM described here, as 
in many other settings, is sensitive to starting val- 
ues; see Parida (2008), Chapter 8.6. Generalizations 
of the simple case discussed here which allow multi- 
ple types of motifs and as well as multiple numbers 
per fragment have been given in Cardon and Stormo 
(1992) and Bailey and Elkan (1995). 

4. EXAMPLE FROM PUBLIC HEALTH: 
MONITORING AIR QUALITY 

Many harmful exposures, radon or diesel emis- 
sions for example, are characterized by having parti- 
cles with very small diameters (less than 0.4 microm- 
eters). Particles this small cannot be directly mea- 
sured, but having estimates of particle sizes are im- 
portant for monitoring air quality. Several measure- 
ment devices have been developed to deal with this 
problem; here we discuss diffusion batteries which 
operate on a principle of indirect measurement. The 
basic principle is the same used for positron emission 
tomography (PET scans) as well as transmission to- 
mography (Vardi, Shepp and Kaufman, 1985; Lange 
and Carson, 1984; Kay, 1997). 

Diffusion batteries are designed to filter out parti- 
cles of different sizes by passing a volume of aerosol 
through a succession of fine wire mesh screens, and 
counting the number of particles remaining in the 
aerosol at each stage. Figure 1 illustrates how the 
diffusion battery works. 

A fixed volume of air is drawn in through the en- 
trance port, with only one of the exit ports open. 
The total number of particles passing through the 
exit port is counted. Thus the observed data con- 
sists of 11 counts of particles. The "zero port" counts 
the total number in the volume of aerosol regardless 
of size, since there are no screens before the zero 
port. The subsequent ports have differing numbers 
of screens which increase the probability that par- 
ticles of different sizes are trapped in the screens, 
and are thus not counted. The smaller particles are 
more likely to be removed at the early stages, since a 
particle's diameter determines how fast they move. 
The smaller particles are moving faster and are more 
likely to hit a barrier (a screen), and become trapped 
at the earlier stages. The larger particles are slug- 
gish; they tend to fall through the battery, only be- 
coming trapped at the end stages of the battery 
where there are many more screens. We estimate the 
distribution of particle sizes from the total particle 
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Fig. 1. Diagram of a diffusion battery. Reproduced from 
the TSI Instruction Manual for Diffusion Battery Models 
3040/3041. 

counts measured at each port by dividing the par- 
ticle size distribution into intervals, and estimating 
the proportions in each interval. 

A natural way to formulate an incomplete data 
problem in this setting is illustrated at the top of 
Box 4. We define complete data as a 2-way array 
of counts of particles in size category j exiting at 
the ith port, Zij, where i = 1,10 and j = 1,8 in 
our example. The unobserved can be modeled as 
independent Poisson counts with E(Zij) = PoWijfj, 
where fj are the frequencies of the jth size category, 
and 



W; 



P(particle of size j exits at the ith port). 



The Wij are calculated from the known character- 
istics of the diffusion battery. The observed counts 
exiting each port are just the row totals of the Zij, 
Pi, plus the count at the zero port, Pq. We note 
that under this set up, the expected values of the 
observed counts follow a simple linear model: 



E(P i ) = P J2w ij xf j . 



Thus estimation can be treated linear regres- 
sion problem where the /j's are the coefficients to 
be estimated and the itfy-'s are the known predic- 
tors. Because the /j's are constrained to be posi- 
tive, and the errors are not normally distributed, 
ordinary least squares does not work well. Typically, 
non-negative least squares or Ridge regression have 
been used as alternatives, but using the EM to ob- 
tain maximum likelihood estimates under the Pois- 
son model described below represents a substantial 
improvement (Maher and Laird, 1985). 

The use of the EM for this application is illus- 
trated in Box 4. The column totals of the array, 
Nj, give the number of particles in a given size cat- 
egory which exit at all ports combined. Only the 
JVj's are needed for the M-step, but taking the full 
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Complete Data: Array of size classified counts (Z's) for each port 
(excluding the port): 

Midpoint of Size Interval (in micrometers) 
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Observed Data: Total counts at each port (including the ?.ero port I: 

P P P P 

M-Slcp: I .si iniaW ilie frci|uciic> of particles in each si/e interval 
Using weighted counts: 

f r =N f fP t xW t , ror/=l,„.,8 

where W f = /'(particle in interval j is counted at any port): this 
probability is calculated from the properties of the diffusion battery . 

E-Step: Compute expected counts of the complete data array as: 

where / V >=< ) 

w t j — P( particle in interval / exits at port /) and W t = ^ nv 

M 

Obtain the E(Nj) = £e(z, ( ) 



array of counts as complete data simplifies the cal- 
culations. The proportions, say fj, in each size cat- 
egory are estimated from the weighted frequencies 
fj = Nj/P x Wj. We use weights, 

Wj = P(particle of size j is counted at any port) 

10 

i=i 

because the complete data counts have been filtered 
according to size. 

At the E-step, we compute the expected values of 
each Z^, conditioning on the observed row margins 
and assuming the fj's are known. 

Conclusion 

This paper merely skims the surface of the multi- 
tude of applications in diverse scientific areas where 
the EM plays an important role, not just in the com- 
putations, but in the conceptualization of the prob- 
lem as well. This volume provides additional exam- 
ples from other areas of the empirical sciences. 
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